Take Home Exercise 3

Take-home Exercise 3 will be similar to one of the prototype module prepared above in term of content but with the following differences:

In this take-home exercise, you are required to select one of the module of your proposed Geospatial Analytics Shiny Application and complete the following tasks:

Required R packages for First and Second Order Spatial Point Patterns

  1. tidyverse:

    • Purpose: A collection of packages for data manipulation and visualization, emphasizing a consistent and tidy data format.

    • Relevance to spatial order: Useful for cleaning and transforming spatial data into a structured format that facilitates analysis.

  2. knitr:

    • Purpose: Dynamic report generation in R, allowing integration of R code and results into documents.

    • Relevance to spatial order: Enables the creation of documents that include spatial analysis results, making it easier to communicate and share findings.

  3. rgdal:

    • Purpose: Provides bindings to the GDAL (Geospatial Data Abstraction Library) for reading and writing geospatial data formats.

    • Relevance to spatial order: Essential for handling spatial data in various formats, ensuring interoperability and compatibility.

  4. maptools:

    • Purpose: Tools for reading and handling spatial objects, particularly shapefiles.

    • Relevance to spatial order: Facilitates the manipulation and analysis of spatial data, especially when dealing with shapefiles.

  5. sf:

    • Purpose: Supports simple features for spatial data representation and manipulation.

    • Relevance to spatial order: Offers a modern and efficient way to handle spatial data, including points, lines, and polygons.

  6. raster:

    • Purpose: Deals with gridded spatial data, such as satellite imagery or climate data.

    • Relevance to spatial order: Useful for working with raster data, which involves spatial information arranged in a grid.

  7. spatstat:

    • Purpose: Analyzing spatial point patterns and processes.

    • Relevance to spatial order: Particularly focused on the first point spatial order, addressing the distribution and arrangement of individual points in space.

  8. tmap:

    • Purpose: Creates thematic maps for visualizing spatial data.

    • Relevance to spatial order: Helps in visually exploring and interpreting spatial patterns in data.

  9. tmaptools:

    • Purpose: Provides additional tools for working with thematic maps.

    • Relevance to spatial order: Complements tmap by offering additional functionalities for spatial data visualization.

  10. gridExtra:

    • Purpose: Extends the grid graphics system to arrange multiple grid-based figures on one page.

    • Relevance to spatial order: Useful for creating complex layouts when visualizing multiple spatial plots or maps.

  11. leaflet:

    • Purpose: Creates interactive web maps using JavaScript.

    • Relevance to spatial order: Enables the development of interactive maps, enhancing the exploration of spatial relationships and patterns.

  12. OpenStreetMap:

    • Purpose: Accesses and interacts with OpenStreetMap data.

    • Relevance to spatial order: Allows integration of OpenStreetMap data into spatial analysis, providing additional context to spatial patterns.

  13. ggstatsplot:

    • Purpose: Enhances ggplot2 with statistical summaries and plots.

    • Relevance to spatial order: Can be used for adding statistical summaries to spatial visualizations created with ggplot2.

  14. statsExpressions:

    • Purpose: Creates customizable expressions for statistical summaries.

    • Relevance to spatial order: Useful for customizing the presentation of statistical information in spatial analysis results.

  15. lubridate:

    • Purpose: Facilitates the manipulation of date-time objects.

    • Relevance to spatial order: Helps in handling temporal aspects of spatial data, which is crucial in understanding how spatial patterns change over time.

pacman::p_load(tidyverse, knitr, rgdal, maptools, sf,raster,spatstat, tmap,tmaptools, gridExtra, leaflet, OpenStreetMap, ggstatsplot, statsExpressions, lubridate, raster, gridExtra, skimr, rstantools, rgeos)
Installing package into 'C:/Users/felie/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
Warning: package 'rgeos' is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.3:
  cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.3/PACKAGES'
Warning in p_install(package, character.only = TRUE, ...):
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'rgeos'
Warning in pacman::p_load(tidyverse, knitr, rgdal, maptools, sf, raster, : Failed to install/load:
rgeos

Data that is available from Airbnb

InsideAirbnb provides a snapshot of the following information:

  • Listings - Summary information on listings

  • Detailed Listings - Detailed listing information of airbnb for rent

  • Calendar - Detailed calendar data for listings

  • Reviews - Summary review data

  • Detailed Reviews - Detailed review data for listings

  • Neighbourhoods - list of neighbourhoods in the city and a neighbourhood GeoJSON file

Readind data from Airbnb

listings <- read_csv("data/listings.csv")
Rows: 3457 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr   (6): name, host_name, neighbourhood_group, neighbourhood, room_type, l...
dbl  (11): id, host_id, latitude, longitude, price, minimum_nights, number_o...
date  (1): last_review

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_listings <- read_csv("data/detailedlistings.csv")
Rows: 3457 Columns: 75
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (25): listing_url, source, name, neighborhood_overview, picture_url, ho...
dbl  (36): id, scrape_id, host_id, host_listings_count, host_total_listings_...
lgl   (9): description, host_is_superhost, host_has_profile_pic, host_identi...
date  (5): last_scraped, host_since, calendar_last_scraped, first_review, la...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
calendar <- read_csv("data/calendar.csv")
Rows: 1261786 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): price
dbl  (3): listing_id, minimum_nights, maximum_nights
lgl  (2): available, adjusted_price
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
reviews <- read_csv("data/reviews.csv")
Rows: 36905 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (1): listing_id
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_reviews <- read_csv("data/detailedreviews.csv")
Rows: 36905 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): reviewer_name, comments
dbl  (3): listing_id, id, reviewer_id
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
neighbourhoods <- read_csv("data/neighbourhoods.csv")
Rows: 55 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): neighbourhood_group, neighbourhood

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Reading Spatial data from Airbnb

nhood_map_sf <- st_read(dsn = "data/neighbourhoods.geojson", 
                        layer="neighbourhoods") %>%
                st_transform(crs = 3414)
Reading layer `neighbourhoods' from data source 
  `C:\Feliciaeng29\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\neighbourhoods.geojson' 
  using driver `GeoJSON'
Simple feature collection with 55 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6054 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84

Data Cleaning

glimpse(listings)
Rows: 3,457
Columns: 18
$ id                             <dbl> 71609, 71896, 71903, 275343, 275344, 28…
$ name                           <chr> "Villa in Singapore · ★4.44 · 2 bedroom…
$ host_id                        <dbl> 367042, 367042, 367042, 1439258, 143925…
$ host_name                      <chr> "Belinda", "Belinda", "Belinda", "Kay",…
$ neighbourhood_group            <chr> "East Region", "East Region", "East Reg…
$ neighbourhood                  <chr> "Tampines", "Tampines", "Tampines", "Bu…
$ latitude                       <dbl> 1.34537, 1.34754, 1.34531, 1.29015, 1.2…
$ longitude                      <dbl> 103.9589, 103.9596, 103.9610, 103.8081,…
$ room_type                      <chr> "Private room", "Private room", "Privat…
$ price                          <dbl> 150, 80, 80, 64, 78, 220, 85, 75, 69, 7…
$ minimum_nights                 <dbl> 92, 92, 92, 60, 60, 92, 92, 60, 60, 92,…
$ number_of_reviews              <dbl> 19, 24, 46, 20, 16, 12, 131, 17, 5, 81,…
$ last_review                    <date> 2020-01-17, 2019-10-13, 2020-01-09, 20…
$ reviews_per_month              <dbl> 0.13, 0.16, 0.30, 0.15, 0.11, 0.09, 0.9…
$ calculated_host_listings_count <dbl> 5, 5, 5, 51, 51, 5, 7, 51, 51, 7, 7, 1,…
$ availability_365               <dbl> 55, 91, 91, 183, 183, 54, 365, 183, 183…
$ number_of_reviews_ltm          <dbl> 0, 0, 0, 0, 3, 0, 0, 1, 2, 0, 0, 0, 0, …
$ license                        <chr> NA, NA, NA, "S0399", "S0399", NA, NA, "…
glimpse(reviews)
Rows: 36,905
Columns: 2
$ listing_id <dbl> 71609, 71609, 71609, 71609, 71609, 71609, 71609, 71609, 716…
$ date       <date> 2011-12-19, 2012-07-17, 2012-09-01, 2012-09-04, 2013-01-02…
glimpse(d_reviews)
Rows: 36,905
Columns: 6
$ listing_id    <dbl> 71609, 71609, 71609, 71609, 71609, 71609, 71609, 71609, …
$ id            <dbl> 793880, 1731810, 2162194, 2190615, 3221837, 10071818, 15…
$ date          <date> 2011-12-19, 2012-07-17, 2012-09-01, 2012-09-04, 2013-01…
$ reviewer_id   <dbl> 1456140, 1804182, 3113461, 1432123, 2759938, 11319720, 1…
$ reviewer_name <chr> "Max", "Zac", "Zahra", "Helmut", "Jack", "Emily", "Steve…
$ comments      <chr> "The rooms were clean and tidy. Beds very comfortable.\r…
glimpse(calendar)
Rows: 1,261,786
Columns: 7
$ listing_id     <dbl> 71609, 71609, 71609, 71609, 71609, 71609, 71609, 71609,…
$ date           <date> 2023-12-27, 2023-12-28, 2023-12-29, 2023-12-30, 2023-1…
$ available      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T…
$ price          <chr> "$108.00", "$108.00", "$108.00", "$108.00", "$108.00", …
$ adjusted_price <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ minimum_nights <dbl> 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92, 92,…
$ maximum_nights <dbl> 1125, 1125, 1125, 1125, 1125, 1125, 1125, 1125, 1125, 1…
glimpse(d_listings)
Rows: 3,457
Columns: 75
$ id                                           <dbl> 71609, 71896, 71903, 2753…
$ listing_url                                  <chr> "https://www.airbnb.com/r…
$ scrape_id                                    <dbl> 2.023123e+13, 2.023123e+1…
$ last_scraped                                 <date> 2023-12-27, 2023-12-26, …
$ source                                       <chr> "previous scrape", "city …
$ name                                         <chr> "Villa in Singapore · ★4.…
$ description                                  <lgl> NA, NA, NA, NA, NA, NA, N…
$ neighborhood_overview                        <chr> NA, NA, "Quiet and view o…
$ picture_url                                  <chr> "https://a0.muscache.com/…
$ host_id                                      <dbl> 367042, 367042, 367042, 1…
$ host_url                                     <chr> "https://www.airbnb.com/u…
$ host_name                                    <chr> "Belinda", "Belinda", "Be…
$ host_since                                   <date> 2011-01-29, 2011-01-29, …
$ host_location                                <chr> "Singapore", "Singapore",…
$ host_about                                   <chr> "Hi My name is Belinda -H…
$ host_response_time                           <chr> "N/A", "N/A", "N/A", "wit…
$ host_response_rate                           <chr> "N/A", "N/A", "N/A", "100…
$ host_acceptance_rate                         <chr> "100%", "100%", "100%", "…
$ host_is_superhost                            <lgl> FALSE, FALSE, FALSE, FALS…
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
$ host_picture_url                             <chr> "https://a0.muscache.com/…
$ host_neighbourhood                           <chr> "Tampines", "Tampines", "…
$ host_listings_count                          <dbl> 5, 5, 5, 51, 51, 5, 7, 51…
$ host_total_listings_count                    <dbl> 15, 15, 15, 68, 68, 15, 8…
$ host_verifications                           <chr> "['email', 'phone']", "['…
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified                       <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ neighbourhood                                <chr> NA, NA, "Singapore, Singa…
$ neighbourhood_cleansed                       <chr> "Tampines", "Tampines", "…
$ neighbourhood_group_cleansed                 <chr> "East Region", "East Regi…
$ latitude                                     <dbl> 1.34537, 1.34754, 1.34531…
$ longitude                                    <dbl> 103.9589, 103.9596, 103.9…
$ property_type                                <chr> "Private room in villa", …
$ room_type                                    <chr> "Private room", "Private …
$ accommodates                                 <dbl> 3, 1, 2, 1, 1, 4, 2, 1, 1…
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text                               <chr> "1 private bath", "Shared…
$ bedrooms                                     <lgl> NA, NA, NA, NA, NA, NA, N…
$ beds                                         <dbl> 3, 1, 2, 1, 1, 5, 1, 1, 1…
$ amenities                                    <chr> "[]", "[]", "[]", "[]", "…
$ price                                        <chr> "$150.00", "$80.00", "$80…
$ minimum_nights                               <dbl> 92, 92, 92, 60, 60, 92, 9…
$ maximum_nights                               <dbl> 365, 365, 365, 999, 999, …
$ minimum_minimum_nights                       <dbl> 92, 92, 92, 60, 60, 92, 9…
$ maximum_minimum_nights                       <dbl> 92, 92, 92, 60, 60, 92, 9…
$ minimum_maximum_nights                       <dbl> 1125, 1125, 1125, 1125, 1…
$ maximum_maximum_nights                       <dbl> 1125, 1125, 1125, 1125, 1…
$ minimum_nights_avg_ntm                       <dbl> 92, 92, 92, 60, 60, 92, 9…
$ maximum_nights_avg_ntm                       <dbl> 1125, 1125, 1125, 1125, 1…
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30                              <dbl> 30, 30, 30, 6, 6, 29, 30,…
$ availability_60                              <dbl> 34, 60, 60, 6, 6, 33, 60,…
$ availability_90                              <dbl> 55, 90, 90, 6, 6, 54, 90,…
$ availability_365                             <dbl> 55, 91, 91, 183, 183, 54,…
$ calendar_last_scraped                        <date> 2023-12-27, 2023-12-26, …
$ number_of_reviews                            <dbl> 19, 24, 46, 20, 16, 12, 1…
$ number_of_reviews_ltm                        <dbl> 0, 0, 0, 0, 3, 0, 0, 1, 2…
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_review                                 <date> 2011-12-19, 2011-07-30, …
$ last_review                                  <date> 2020-01-17, 2019-10-13, …
$ review_scores_rating                         <dbl> 4.44, 4.16, 4.41, 4.40, 4…
$ review_scores_accuracy                       <dbl> 4.37, 4.22, 4.39, 4.16, 4…
$ review_scores_cleanliness                    <dbl> 4.00, 4.09, 4.52, 4.26, 4…
$ review_scores_checkin                        <dbl> 4.63, 4.43, 4.63, 4.47, 4…
$ review_scores_communication                  <dbl> 4.78, 4.43, 4.64, 4.42, 4…
$ review_scores_location                       <dbl> 4.26, 4.17, 4.50, 4.53, 4…
$ review_scores_value                          <dbl> 4.32, 4.04, 4.36, 4.63, 4…
$ license                                      <chr> NA, NA, NA, "S0399", "S03…
$ instant_bookable                             <lgl> FALSE, FALSE, FALSE, TRUE…
$ calculated_host_listings_count               <dbl> 5, 5, 5, 51, 51, 5, 7, 51…
$ calculated_host_listings_count_entire_homes  <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0…
$ calculated_host_listings_count_private_rooms <dbl> 5, 5, 5, 51, 51, 5, 6, 51…
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month                            <dbl> 0.13, 0.16, 0.30, 0.15, 0…

Change the data type of id to characters

listings <- listings %>% mutate_at(vars(id, host_id), as.character)
reviews <- reviews %>% mutate_at(vars(listing_id), as.character)
d_reviews <- d_reviews %>% mutate_at(vars(id, reviewer_id, listing_id), as.character)
calendar <- calendar %>% mutate_at(vars(listing_id), as.character)
d_listings <- d_listings %>% mutate_at(vars(host_id, id), as.character)

Change price in detailed listings to numerical & remove $ and, symbol in cloumns where currency is read as character.

strip_dollars = function(x) {as.numeric(gsub("[\\$,]", "", x))}
d_listings[,61:65] <- sapply(d_listings[,61:65], strip_dollars)
Warning in FUN(X[[i]], ...): NAs introduced by coercion
d_listings[,67] <- sapply(d_listings[,67], strip_dollars)

Check for Missing Data

We check if there is any missing data and decide how to handle it. Missing data will include zero value, which we need to change to be able to analyse correctly.

Listings data

skim(listings)
Data summary
Name listings
Number of rows 3457
Number of columns 18
_______________________
Column type frequency:
character 8
Date 1
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1.0 5 19 0 3457 0
name 0 1.0 31 81 0 1427 0
host_id 0 1.0 5 9 0 952 0
host_name 0 1.0 1 32 0 761 0
neighbourhood_group 0 1.0 11 17 0 5 0
neighbourhood 0 1.0 4 23 0 44 0
room_type 0 1.0 10 15 0 4 0
license 2070 0.4 5 29 0 112 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_review 1611 0.53 2014-06-28 2023-12-27 2022-05-07 999

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
latitude 0 1.00 1.31 0.03 1.22 1.29 1.30 1.32 1.46 ▁▇▂▁▁
longitude 0 1.00 103.84 0.04 103.63 103.83 103.85 103.86 103.99 ▁▁▇▇▁
price 128 0.96 270.36 528.21 13.00 80.00 175.00 289.00 10286.00 ▇▁▁▁▁
minimum_nights 0 1.00 65.81 67.21 1.00 6.00 92.00 92.00 1000.00 ▇▁▁▁▁
number_of_reviews 0 1.00 10.68 34.38 0.00 0.00 1.00 5.00 757.00 ▇▁▁▁▁
reviews_per_month 1611 0.53 0.59 1.34 0.01 0.05 0.18 0.63 29.62 ▇▁▁▁▁
calculated_host_listings_count 0 1.00 47.25 72.12 1.00 2.00 11.00 62.00 260.00 ▇▂▁▁▁
availability_365 0 1.00 248.32 139.21 0.00 150.00 329.00 364.00 365.00 ▃▁▁▂▇
number_of_reviews_ltm 0 1.00 2.39 12.20 0.00 0.00 0.00 0.00 396.00 ▇▁▁▁▁

Calender

skim(calendar)
Data summary
Name calendar
Number of rows 1261786
Number of columns 7
_______________________
Column type frequency:
character 2
Date 1
logical 2
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_id 0 1 5 19 0 3457 0
price 0 1 6 10 0 483 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2023-12-26 2024-12-25 2024-06-26 366

Variable type: logical

skim_variable n_missing complete_rate mean count
available 0 1 0.69 TRU: 867321, FAL: 394465
adjusted_price 1261786 0 NaN :

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
minimum_nights 3 1 72.93 81.61 1 6 92 92 1e+03 ▇▁▁▁▁
maximum_nights 3 1 880.19 1826.91 1 365 1125 1125 1e+05 ▇▁▁▁▁

Reviews and Detailed Reviews

skim(reviews)
Data summary
Name reviews
Number of rows 36905
Number of columns 2
_______________________
Column type frequency:
character 1
Date 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_id 0 1 5 19 0 1846 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2011-05-04 2023-12-27 2019-12-16 3650
skim(d_reviews)
Data summary
Name d_reviews
Number of rows 36905
Number of columns 6
_______________________
Column type frequency:
character 5
Date 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_id 0 1 5 19 0 1846 0
id 0 1 6 19 0 36905 0
reviewer_id 0 1 5 9 0 34123 0
reviewer_name 0 1 1 34 0 16055 0
comments 5 1 1 5472 0 35192 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2011-05-04 2023-12-27 2019-12-16 3650

Detailed Listings

skim(d_listings)
Data summary
Name d_listings
Number of rows 3457
Number of columns 75
_______________________
Column type frequency:
character 27
Date 4
logical 9
numeric 35
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1.00 5 19 0 3457 0
listing_url 0 1.00 34 48 0 3457 0
source 0 1.00 11 15 0 2 0
name 0 1.00 31 81 0 1427 0
neighborhood_overview 1214 0.65 3 1000 0 1011 0
picture_url 0 1.00 61 156 0 3253 0
host_id 0 1.00 5 9 0 952 0
host_url 0 1.00 39 43 0 952 0
host_name 0 1.00 1 32 0 761 0
host_location 1296 0.63 5 32 0 44 0
host_about 1206 0.65 1 2315 0 446 1
host_response_time 0 1.00 3 18 0 5 0
host_response_rate 0 1.00 2 4 0 36 0
host_acceptance_rate 0 1.00 2 4 0 52 0
host_thumbnail_url 0 1.00 55 131 0 917 0
host_picture_url 0 1.00 57 134 0 917 0
host_neighbourhood 224 0.94 4 18 0 59 0
host_verifications 0 1.00 2 32 0 7 0
neighbourhood 1214 0.65 9 39 0 36 0
neighbourhood_cleansed 0 1.00 4 23 0 44 0
neighbourhood_group_cleansed 0 1.00 11 17 0 5 0
property_type 0 1.00 4 34 0 51 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 19 0.99 6 17 0 39 0
amenities 0 1.00 2 2 0 1 0
price 128 0.96 6 10 0 626 0
license 2070 0.40 5 29 0 112 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2023-12-26 2023-12-27 2023-12-27 2
host_since 0 1.00 2009-06-29 2023-12-19 2017-01-03 818
calendar_last_scraped 0 1.00 2023-12-26 2023-12-27 2023-12-27 2
first_review 1611 0.53 2011-05-04 2023-12-13 2019-12-28 1308

Variable type: logical

skim_variable n_missing complete_rate mean count
description 3457 0.00 NaN :
host_is_superhost 4 1.00 0.11 FAL: 3080, TRU: 373
host_has_profile_pic 0 1.00 0.98 TRU: 3399, FAL: 58
host_identity_verified 0 1.00 0.89 TRU: 3094, FAL: 363
bathrooms 3457 0.00 NaN :
bedrooms 3457 0.00 NaN :
calendar_updated 3457 0.00 NaN :
has_availability 128 0.96 0.99 TRU: 3296, FAL: 33
instant_bookable 0 1.00 0.37 FAL: 2182, TRU: 1275

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
scrape_id 0 1.00 2.023123e+13 0.00 2.023123e+13 2.023123e+13 2.023123e+13 2.023123e+13 2.023123e+13 ▁▁▇▁▁
host_listings_count 0 1.00 8.004000e+01 156.12 1.000000e+00 3.000000e+00 1.100000e+01 6.200000e+01 5.800000e+02 ▇▁▁▁▁
host_total_listings_count 0 1.00 1.299400e+02 243.48 1.000000e+00 5.000000e+00 1.700000e+01 9.600000e+01 8.750000e+02 ▇▁▁▁▁
latitude 0 1.00 1.310000e+00 0.03 1.220000e+00 1.290000e+00 1.300000e+00 1.320000e+00 1.460000e+00 ▁▇▂▁▁
longitude 0 1.00 1.038400e+02 0.04 1.036300e+02 1.038300e+02 1.038500e+02 1.038600e+02 1.039900e+02 ▁▁▇▇▁
accommodates 0 1.00 2.810000e+00 2.21 1.000000e+00 1.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 ▇▁▁▁▁
beds 82 0.98 1.800000e+00 2.08 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 4.600000e+01 ▇▁▁▁▁
minimum_nights 0 1.00 6.581000e+01 67.21 1.000000e+00 6.000000e+00 9.200000e+01 9.200000e+01 1.000000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 7.816100e+02 1740.04 2.000000e+00 3.650000e+02 1.124000e+03 1.125000e+03 1.000000e+05 ▇▁▁▁▁
minimum_minimum_nights 0 1.00 6.576000e+01 67.53 1.000000e+00 6.000000e+00 9.200000e+01 9.200000e+01 1.000000e+03 ▇▁▁▁▁
maximum_minimum_nights 0 1.00 7.378000e+01 83.02 1.000000e+00 6.000000e+00 9.200000e+01 9.200000e+01 1.000000e+03 ▇▁▁▁▁
minimum_maximum_nights 0 1.00 8.779800e+02 1827.50 1.000000e+00 3.650000e+02 1.125000e+03 1.125000e+03 1.000000e+05 ▇▁▁▁▁
maximum_maximum_nights 0 1.00 8.883600e+02 1825.49 1.000000e+00 3.650000e+02 1.125000e+03 1.125000e+03 1.000000e+05 ▇▁▁▁▁
minimum_nights_avg_ntm 0 1.00 7.291000e+01 81.28 1.000000e+00 6.000000e+00 9.200000e+01 9.200000e+01 1.000000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 0 1.00 8.801600e+02 1827.00 1.000000e+00 3.650000e+02 1.125000e+03 1.125000e+03 1.000000e+05 ▇▁▁▁▁
availability_30 0 1.00 1.782000e+01 12.60 0.000000e+00 0.000000e+00 2.400000e+01 2.900000e+01 3.000000e+01 ▅▁▁▂▇
availability_60 0 1.00 3.788000e+01 24.75 0.000000e+00 6.000000e+00 5.300000e+01 5.900000e+01 6.000000e+01 ▃▁▁▁▇
availability_90 0 1.00 5.873000e+01 36.32 0.000000e+00 2.100000e+01 8.000000e+01 8.900000e+01 9.000000e+01 ▃▁▁▁▇
availability_365 0 1.00 2.483200e+02 139.21 0.000000e+00 1.500000e+02 3.290000e+02 3.640000e+02 3.650000e+02 ▃▁▁▂▇
number_of_reviews 0 1.00 1.068000e+01 34.38 0.000000e+00 0.000000e+00 1.000000e+00 5.000000e+00 7.570000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 2.390000e+00 12.20 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.960000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 1.700000e-01 1.08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 3.800000e+01 ▇▁▁▁▁
last_review 3457 0.00 NaN NA NA NA NA NA NA
review_scores_rating 1611 0.53 4.540000e+00 0.62 1.000000e+00 4.360000e+00 4.710000e+00 5.000000e+00 5.000000e+00 ▁▁▁▂▇
review_scores_accuracy 1613 0.53 4.580000e+00 0.62 1.000000e+00 4.450000e+00 4.780000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 1613 0.53 4.500000e+00 0.64 1.000000e+00 4.280000e+00 4.670000e+00 5.000000e+00 5.000000e+00 ▁▁▁▂▇
review_scores_checkin 1613 0.53 4.730000e+00 0.50 1.000000e+00 4.670000e+00 4.910000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 1612 0.53 4.710000e+00 0.56 1.000000e+00 4.670000e+00 4.910000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 1613 0.53 4.680000e+00 0.48 1.000000e+00 4.560000e+00 4.830000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 1613 0.53 4.440000e+00 0.64 1.000000e+00 4.230000e+00 4.590000e+00 4.940000e+00 5.000000e+00 ▁▁▁▂▇
calculated_host_listings_count 0 1.00 4.725000e+01 72.12 1.000000e+00 2.000000e+00 1.100000e+01 6.200000e+01 2.600000e+02 ▇▂▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 3.720000e+01 73.83 0.000000e+00 0.000000e+00 1.000000e+00 2.700000e+01 2.600000e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 9.340000e+00 20.08 0.000000e+00 0.000000e+00 1.000000e+00 6.000000e+00 9.100000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 3.700000e-01 1.93 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 1.800000e+01 ▇▁▁▁▁
reviews_per_month 1611 0.53 5.900000e-01 1.34 1.000000e-02 5.000000e-02 1.800000e-01 6.300000e-01 2.962000e+01 ▇▁▁▁▁

Data Wrangling

We need to convert the listings data into sf object.

listings_sf <- listings %>% 
                st_as_sf(coords = c("longitude", "latitude"),
                         crs = 4326) %>%
                st_transform(crs = 3414)

Display the 10 records and show the geometry type and we can check that projected CRIS is svy21.

head(listings_sf)
Simple feature collection with 6 features and 16 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 25197.84 ymin: 30085.83 xmax: 42209.55 ymax: 36630.01
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 17
  id     name      host_id host_name neighbourhood_group neighbourhood room_type
  <chr>  <chr>     <chr>   <chr>     <chr>               <chr>         <chr>    
1 71609  Villa in… 367042  Belinda   East Region         Tampines      Private …
2 71896  Home in … 367042  Belinda   East Region         Tampines      Private …
3 71903  Home in … 367042  Belinda   East Region         Tampines      Private …
4 275343 Rental u… 1439258 Kay       Central Region      Bukit Merah   Private …
5 275344 Rental u… 1439258 Kay       Central Region      Bukit Merah   Private …
6 289234 Home in … 367042  Belinda   East Region         Tampines      Private …
# ℹ 10 more variables: price <dbl>, minimum_nights <dbl>,
#   number_of_reviews <dbl>, last_review <date>, reviews_per_month <dbl>,
#   calculated_host_listings_count <dbl>, availability_365 <dbl>,
#   number_of_reviews_ltm <dbl>, license <chr>, geometry <POINT [m]>

Show the point details of the geometry column, givin gthe x, y coordianted in SVY21

glimpse(listings_sf)
Rows: 3,457
Columns: 17
$ id                             <chr> "71609", "71896", "71903", "275343", "2…
$ name                           <chr> "Villa in Singapore · ★4.44 · 2 bedroom…
$ host_id                        <chr> "367042", "367042", "367042", "1439258"…
$ host_name                      <chr> "Belinda", "Belinda", "Belinda", "Kay",…
$ neighbourhood_group            <chr> "East Region", "East Region", "East Reg…
$ neighbourhood                  <chr> "Tampines", "Tampines", "Tampines", "Bu…
$ room_type                      <chr> "Private room", "Private room", "Privat…
$ price                          <dbl> 150, 80, 80, 64, 78, 220, 85, 75, 69, 7…
$ minimum_nights                 <dbl> 92, 92, 92, 60, 60, 92, 92, 60, 60, 92,…
$ number_of_reviews              <dbl> 19, 24, 46, 20, 16, 12, 131, 17, 5, 81,…
$ last_review                    <date> 2020-01-17, 2019-10-13, 2020-01-09, 20…
$ reviews_per_month              <dbl> 0.13, 0.16, 0.30, 0.15, 0.11, 0.09, 0.9…
$ calculated_host_listings_count <dbl> 5, 5, 5, 51, 51, 5, 7, 51, 51, 7, 7, 1,…
$ availability_365               <dbl> 55, 91, 91, 183, 183, 54, 365, 183, 183…
$ number_of_reviews_ltm          <dbl> 0, 0, 0, 0, 3, 0, 0, 1, 2, 0, 0, 0, 0, …
$ license                        <chr> NA, NA, NA, "S0399", "S0399", NA, NA, "…
$ geometry                       <POINT [m]> POINT (41972.5 36390.05), POINT (…

Data Analysis

Type of Accomdation

Room types by region

Summarizing the types of listings by neighbourhood groups

regionlist <- listings %>%
              group_by(neighbourhood_group, room_type) %>%
              summarise(
                num_listings = n(),
                avg_price = mean(price),
                med_price = median(price))
`summarise()` has grouped output by 'neighbourhood_group'. You can override
using the `.groups` argument.

Plotting the type of accomdation by region

ggplot(regionlist, aes(x=room_type, fill = room_type)) + theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + 
      geom_col(aes(y = num_listings)) +
      facet_grid(cols=vars(neighbourhood_group), margins = T, labeller = labeller(neighbourhood_group = label_wrap_gen(width = 5, multi_line = TRUE))) +
      labs(x = "", y = "No. of listings", fill = "Room Type")

Pricing of Room Types

Remove price outliers

outlier_price <- quantile(listings$price, 0.99, na.rm = TRUE)
listings_cleanprice <- listings %>% filter(price <= outlier_price)

Plot boxplot of prices for each room type

p1 <- ggplot(listings, aes(x=room_type, fill = room_type)) + 
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank(), legend.position = "None") + 
  geom_boxplot(aes(y=price)) +
  labs(x = "", y = "Listing Price", fill = "Room Type", title = "Listing price")

Cleaned pricing

p2 <- ggplot(listings_cleanprice, aes(x=room_type, fill = room_type)) + 
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) + 
  geom_boxplot(aes(y=price)) +
  labs(x = "", y = "Listing Price", fill = "Room Type", title = "Listing price (outliers removed)")

grid.arrange(p1, p2, nrow = 1)
Warning: Removed 128 rows containing non-finite outside the scale range
(`stat_boxplot()`).

Room Types by Price and Region

ggplot(listings_cleanprice, aes(x=neighbourhood_group, fill = neighbourhood_group)) + 
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank()) +
  geom_boxplot(aes(y=price)) +
  facet_grid(~room_type, labeller = labeller(room_type = label_wrap_gen(width = 5, multi_line = TRUE))) +
  labs(x = "", y = "Listing Price", fill = "Region")

Deductive Data Analysis

We utilize ggstatsplot to conduct confirmatory data analysis on prices across various neighborhoods. Our null hypothesis posits that there is no difference in listing prices among different neighborhood groups. We reject this hypothesis if the p-value exceeds 0.05 at a 95% significance level. Initially, we extract only the necessary variables (neighbourhood_group, room_type, and price) and feed them into the ggbetweenstats() function. This function generates the p-value and a violin plot illustrating listing prices categorized by neighborhood group. Additionally, it computes pairwise comparisons between the grouped variables, displaying only the significant comparisons along with their respective p-values.

conftest <- listings_cleanprice %>% dplyr::select(neighbourhood_group, room_type, price)
ggbetweenstats(
  data = conftest,
  x = neighbourhood_group,
  y = price
)
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf

conftest2 <- conftest %>% filter(room_type != "Shared room", room_type != "Hotel room")
grouped_ggbetweenstats(
  data = conftest2,
  x = neighbourhood_group,
  y = price,
  grouping.var = room_type,
  ggsignif.args = list(textsize = 4, tip_length = 0.01),
  p.adjust.method = "bonferroni", # method for adjusting p-values for multiple comparisons
  # adding new components to `ggstatsplot` default
  ggplot.component = list(ggplot2::scale_y_continuous(sec.axis = ggplot2::dup_axis())),
  k = 3,
  title.prefix = "Room Type",
  palette = "default_jama",
  package = "ggsci",
  plotgrid.args = list(nrow = 2),
  title.text = "Differences in listing prices by neighbourhoods for different room types"
)
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

sharedrooms <- conftest %>% filter(room_type == "Shared room")
 ggbetweenstats(
   data = sharedrooms,
   x = neighbourhood_group,
   y = price
)
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf

Hosts & Listings

Host with multiple listings

Create a table of percetnage of hosts by number of listings

list_byhost <- listings %>%
                group_by(host_id, host_name) %>%
                count(name = "number_of_listings", sort = TRUE) %>%
                ungroup() %>%
                group_by(number_of_listings) %>%
                count(name = "number_of_hosts")

Plot above table

ggplot(list_byhost, aes(x=number_of_listings, y= number_of_hosts/sum(number_of_hosts)*100)) +
  geom_point() +
  labs(y="Percentage of hosts", title = "% of hosts vs number of listings", x = "number of listings")

To explore the whether the market is dominated by hosts with single or multiple listings, we plot a Pareto chart by adding the cumulative frequency of the number of listings of hosts, to a descending list of airbnb rentals by host. We see that almost 75% of the Airbnb ‘stock’ are taken up by hosts with multiple listings.

There are currently no regulations or legislation around the number of listings that a host may have, unlike in other cities that regulate the maximum number of listings (e.g. New York). People with more than 1 listing are likely to be agents managing these properties on behalf of landlords.

However, there are restrictions on the minimum stay for short term rentals: 3 months for private housing and 6 months for HDB flats. This means that typical tourist stays (e.g. 2-7 days) in Airbnb listings would technically be illegal. Despite this, 89% of listings have a minimum stay of less than the legislated minimum stay.

Create Pareto Chart

list_byhost <- list_byhost[order(list_byhost$number_of_listings, decreasing = TRUE),]
list_byhost$number_of_listings1 <- factor(list_byhost$number_of_listings, levels = list_byhost$number_of_listings)
list_byhost$listfreq <- list_byhost$number_of_hosts * list_byhost$number_of_listings
list_byhost$cumul <- cumsum(list_byhost$listfreq)
nr <- nrow(list_byhost)
N <- sum(list_byhost$listfreq)
y2 <- c("  0%", " 10%", " 20%", " 30%", " 40%", " 50%", " 60%", " 70%", " 80%", " 90%", "100%")
ggplot(list_byhost, aes(x=number_of_listings1)) +
  geom_bar(aes(y=number_of_hosts), fill = "blue", stat = "identity") +
  geom_point(aes(x=number_of_listings1, y=cumul)) +
  geom_line(aes(x=number_of_listings1, y=cumul)) +
  geom_path(aes(y=cumul, group=1)) +
  labs(y="Frequency", title = "Pareto Chart of hosts and listings", x = "No. of listings") +
  theme(plot.margin = margin(c(1,1,1,1), unit="cm"), axis.text.x = element_text(angle=90, vjust=0.6)) +
  annotate("text", x = nr + 3, y = seq(0, N, N/10), label = y2, size = 3.5, hjust = "inward")
`geom_line()`: Each group consists of only one observation.
ℹ Do you need to adjust the group aesthetic?

Distribution of host type and room type

Create column with single or multiple host types

listings <- listings %>% mutate(host_type = ifelse(calculated_host_listings_count ==1, "Single", "Multiple"))
mosaicplot(listings$room_type ~ listings$host_type, color = c("steelblue", "wheat"), xlab = "Room Type", ylab = "Host Type", main = "Mosaic Plot of Room Type and Host Type")

Most hosts with Single listings mainly offer private rooms, followed by shared rooms. This suggests they’re likely renting out their extra or shared bedrooms to earn extra money. On the other hand, hosts offering entire homes or apartments might own investment properties or be away during the rental period. There are also a few hosts with single hotel room listings, which could be because of mistakes in the room type or because they offer special or boutique accommodations. Usually, hotel operators have multiple listings, so these single hotel room listings might be unique.

Analysis of price of room types by single or multiple hosts

# Set seed for reproducibility
set.seed(123)
test2 <- listings %>% dplyr::select(room_type, host_type, neighbourhood_group, price)
grouped_ggbetweenstats(
  data = test2,
  x = host_type,
  y = price,
  grouping.var = room_type,
  ggsignif.args = list(textsize = 4, tip_length = 0.01),
  p.adjust.method = "bonferroni", # method for adjusting p-values for multiple comparisons
  # adding new components to `ggstatsplot` default
  ggplot.component = list(ggplot2::scale_y_continuous(sec.axis = ggplot2::dup_axis())),
  k = 3,
  title.prefix = "Room Type",
  palette = "default_jama",
  package = "ggsci",
  plotgrid.args = list(nrow = 2),
  title.text = "Differences in listing prices for single/multiple hosts by different room types"
)
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.
Scale for y is already present.
Adding another scale for y, which will replace the existing scale.

grouped_ggbetweenstats(
  data = test2 %>% filter(room_type != "Hotel room"),
  x = host_type,
  y = price,
  grouping.var = neighbourhood_group,
  ggsignif.args = list(textsize = 4, tip_length = 0.01),
  p.adjust.method = "bonferroni", # method for adjusting p-values for multiple comparisons
  # adding new components to `ggstatsplot` default
  # ggplot.component = list(ggplot2::scale_y_continuous(sec.axis = ggplot2::dup_axis())),
  # k = 3,
  title.prefix = "Room Type",
  palette = "default_jama",
  package = "ggsci",
  # plotgrid.args = list(nrow = 2),
  title.text = "Differences in listing prices for single/multiple hosts by different neighbourhoods", 
  output = "subtitle"
)
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf

Which year did the hosts join airbnb

host_byyear <- d_listings %>% dplyr::select(id, host_id, host_since) %>% group_by(year_joined = year(host_since)) %>% drop_na() %>% summarise(number_hosts = n()) %>% ungroup() %>% mutate(change = (number_hosts - lag(number_hosts)) / lag(number_hosts)*100)

ggplot(host_byyear, aes(x=year_joined, y = number_hosts)) + geom_bar(stat = "identity", fill = "steelblue") + 
   labs(title = "New hosts by year", y = "No. of hosts joined", x = "year") +
   geom_text(aes(label = number_hosts), vjust = -0.3)

Detailed Information on Hosts

How many listings do not have reviews

d_listings %>% filter(., host_acceptance_rate == "0%" | is.na(host_acceptance_rate)) %>% dplyr::select(id, host_acceptance_rate, number_of_reviews) %>% arrange(desc(number_of_reviews))
# A tibble: 97 × 3
   id       host_acceptance_rate number_of_reviews
   <chr>    <chr>                            <dbl>
 1 22051870 0%                                 159
 2 22745179 0%                                 145
 3 32411075 0%                                 141
 4 16502098 0%                                 140
 5 11228389 0%                                  84
 6 37484436 0%                                  69
 7 17684352 0%                                  47
 8 17949590 0%                                  47
 9 7321238  0%                                  34
10 32205289 0%                                  31
# ℹ 87 more rows

Reviews

Listings with no reviews

no_reviews <- listings_sf %>% filter(is.na(last_review))
host_join <- d_listings %>% dplyr::select(id, host_since)
no_reviews_host <- left_join(no_reviews, host_join, by = c("id")) %>% group_by(year_joined = year(host_since)) %>% summarise(number_hosts = n()) %>% drop_na()  %>% ungroup()
no_reviews_host$year_joined = as.character(no_reviews_host$year_joined)
ggplot(no_reviews_host, aes(x=year_joined, y = number_hosts)) + geom_bar(stat = "identity", fill = "steelblue4") + 
   labs(title = "Listings with no reviews - Hosts by year", y = "No. of hosts joined", x = "year") + geom_text(aes(label = number_hosts), vjust = -0.3) + scale_x_discrete(breaks = no_reviews_host$year_joined)

Review score

select relevant data for review scores

review_scores <- d_listings %>% dplyr::select(id, host_id, number_of_reviews, room_type, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_communication, review_scores_checkin, review_scores_location, review_scores_value, reviews_per_month)

Plot histogram of overall rating

ggplot(review_scores, aes(x=review_scores_rating)) + geom_histogram(aes(fill = room_type), bins = 20) + stat_bin(aes(label = ..count..), bins = 20, size = 3, geom= "text", vjust = -1)
Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(count)` instead.
Warning: Removed 1611 rows containing non-finite outside the scale range (`stat_bin()`).
Removed 1611 rows containing non-finite outside the scale range (`stat_bin()`).

Dsitribution of overall review score

Plot histogram of attribute score

ggplot(gather(review_scores[, -c(1:5,12)], cols, value), aes(x = value)) + 
       geom_histogram(binwidth = 1) + facet_grid(.~cols)
Warning: Removed 9677 rows containing non-finite outside the scale range
(`stat_bin()`).

Spatial Distribution of Airbnb listings in Singapore

Handling Spatial Data Outliers

Identify listings that fall within the Water Cachement Area

filter(listings, neighbourhood == "Central Water Catchment")
# A tibble: 5 × 19
  id          name  host_id host_name neighbourhood_group neighbourhood latitude
  <chr>       <chr> <chr>   <chr>     <chr>               <chr>            <dbl>
1 36009935    Serv… 238891… Neha      North Region        Central Wate…     1.35
2 36852543    Cond… 156409… Tushita   North Region        Central Wate…     1.35
3 43424855    Plac… 346107… Chris     North Region        Central Wate…     1.35
4 46669109    Rent… 238891… Neha      North Region        Central Wate…     1.35
5 9148448679… Rent… 444964… Richard   North Region        Central Wate…     1.36
# ℹ 12 more variables: longitude <dbl>, room_type <chr>, price <dbl>,
#   minimum_nights <dbl>, number_of_reviews <dbl>, last_review <date>,
#   reviews_per_month <dbl>, calculated_host_listings_count <dbl>,
#   availability_365 <dbl>, number_of_reviews_ltm <dbl>, license <chr>,
#   host_type <chr>

Remove listings in the Central Water Catchment, Sungei Kadut and Mandai areas

listings_clean <- filter(listings_sf, !neighbourhood %in% c("Central Water Catchment", "Sungei Kadut", "Mandai", "Western Water Catchment")) %>% st_as_sf()

Mapping Airbnb listings in Singapore

Loading basemap raster with bounding box

Read in OSM raster of listings data for plot view and create bounding box

sg_osm <- tmaptools::read_osm(listings_clean, ext=1.3)
bb_sg_osm <- st_bbox(listings_clean, crs = 3414)

Listing by room types and neighbourhood

Plotting neighbourhood listings on tmap

st_is_valid(nhood_map_sf)
 [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
[13]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
[25]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
[37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE
[49]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
nhood_map_sf <- st_make_valid(nhood_map_sf)
st_is_valid(nhood_map_sf)
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
class(nhood_map_sf)
[1] "sf"         "data.frame"
nhood_map_sf <- na.omit(nhood_map_sf)

Deleting cos after analyzing realise the geometries is invalid

nhood_map_sf <- nhood_map_sf[-c(28, 45), ]
tmap_mode("view")
tmap mode set to interactive viewing
# Plotting points
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(nhood_map_sf) +
  tm_polygons(alpha = 0.3) +
tm_shape(listings_clean) +
  tm_symbols(col="room_type", size = 0.2) +
  tm_view(set.zoom.limits = c(11, 17)) +
  tm_facets(by="room_type") +
  tm_layout(legend.show = F)

Rental prices by room type

Removing price outliers

listings_sf_price <- listings_clean %>% filter(price <= outlier_price)
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
  tm_rgb() +
tm_shape(nhood_map_sf) +
  tm_polygons(alpha = 0.3) +
  tm_shape(listings_sf_price) +
  tm_symbols(col = "price", size = 0.2, palette = "YlOrBr", legend.hist = TRUE) +
  # tm_view(set.zoom.limits = c(11, 18)) +
  tm_facets(by="room_type") +
  tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.stack = "horizontal", legend.hist.height = 1, legend.hist.width = 0.85, legend.outside.size=0.1)
OpenStreetMapData read with read_osm is static, so not usable in view mode. Please use tm_basemap or tm_tiles, with the provider name set to "OpenStreetMap.Mapnik"

Rental prices by room type and host type

listings_sf_price <- listings_sf_price %>% mutate(host_type = ifelse(calculated_host_listings_count ==1, "Single", "Multiple"))
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
  tm_rgb() +
tm_shape(nhood_map_sf) +
  tm_polygons(alpha = 0.3) +
  tm_shape(listings_sf_price) +
  tm_symbols(col = "host_type", shape = "price", size = 0.2, title.col = "Host Type", title.shape = "Price") +
  tm_facets(by="room_type")+
  tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.stack = "horizontal", legend.outside.size=0.1)
Symbol shapes other than circles or icons are not supported in view mode.
OpenStreetMapData read with read_osm is static, so not usable in view mode. Please use tm_basemap or tm_tiles, with the provider name set to "OpenStreetMap.Mapnik"

Chlorpleth map of listings and median

Summary of cleaned price listings by neighbourhood and room type

# Create a summary of the cleaned price listings by neighbourhood and room type (number of listings and median price)
listings_cleanprice_sum <- st_drop_geometry(listings_sf_price) %>%
  group_by(neighbourhood, room_type) %>%
  summarise(num_listings = n(), med_price = median(price), .groups = "keep") %>%
  arrange(desc(num_listings))

listings_join <- left_join(nhood_map_sf, listings_cleanprice_sum, by = c("neighbourhood"))
#take out all the NA values data set
listings_join <- na.omit(listings_join)
tmap_mode("plot")
tmap mode set to plotting
tmap_arrange(
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
  tm_rgb() +
  tm_shape(listings_join) +
   tm_polygons("med_price", title = "Median Price") +
   tm_view(set.zoom.limits = c(10, 18)) + 
   tm_facets(by="room_type", drop.NA.facets = T) +
   tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.outside.size = 0.2),
tm_basemap(leaflet::providers$OpenStreetMap) +
tm_shape(sg_osm, bbox=bb_sg_osm) +
  tm_rgb() +
   tm_shape(listings_join) +
   tm_polygons("num_listings", title = "No. of listings", palette = "Blues", alpha = 0.6) +
   tm_view(set.zoom.limits = c(10, 18)) + 
   tm_facets(by="room_type", drop.NA.facets = T) +
   tm_layout(legend.outside = TRUE, legend.outside.position = "bottom", legend.outside.size = 0.2)
)

Data that will be used for wrangling subzone map

listings to be converted from the listings_sf dataframe to SVY21 format (otherwise it will be out of bounds when using Spatstat) and converting to SpatialPoints (sp) #Remove listings from neighbourhoods that are not zoned for residential

listings_clean <- st_transform(listings_clean, 3414)
nonreslisting <- c("Tuas", "Pioneer")
listings_clean <- listings_clean %>% filter(!neighbourhood %in% nonreslisting)
listings_sp <- listings_clean %>% dplyr::select(geometry, room_type) %>% as(., Class = "Spatial")
summary(listings_sp)
Object of class SpatialPointsDataFrame
Coordinates:
               min      max
coords.x1 11133.06 45389.01
coords.x2 22713.81 48821.86
Is projected: TRUE 
proj4string :
[+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1
+x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0
+units=m +no_defs]
Number of points: 3443
Data attributes:
  room_type        
 Length:3443       
 Class :character  
 Mode  :character  
plot(listings_sp)

Wrangling subzone

  • MP14_SUBZONE_WEB_PL.shp
mpsz_sf <- st_read(dsn = "data/geospatial",
                layer = "MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\Feliciaeng29\IS415-GAA\Take-home_Ex\Take-home_Ex03\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21
sg_osm <- read_osm(listings_clean, ext=1.3)

# Ensure that geometry is valid and check crs of subzone map
mpsz_sf <- st_make_valid(mpsz_sf)
all(st_is_valid(mpsz_sf))
[1] TRUE
crs(mpsz_sf)
[1] "PROJCRS[\"SVY21\",\n    BASEGEOGCRS[\"SVY21[WGS84]\",\n        DATUM[\"World Geodetic System 1984\",\n            ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n                LENGTHUNIT[\"metre\",1]],\n            ID[\"EPSG\",6326]],\n        PRIMEM[\"Greenwich\",0,\n            ANGLEUNIT[\"Degree\",0.0174532925199433]]],\n    CONVERSION[\"unnamed\",\n        METHOD[\"Transverse Mercator\",\n            ID[\"EPSG\",9807]],\n        PARAMETER[\"Latitude of natural origin\",1.36666666666667,\n            ANGLEUNIT[\"Degree\",0.0174532925199433],\n            ID[\"EPSG\",8801]],\n        PARAMETER[\"Longitude of natural origin\",103.833333333333,\n            ANGLEUNIT[\"Degree\",0.0174532925199433],\n            ID[\"EPSG\",8802]],\n        PARAMETER[\"Scale factor at natural origin\",1,\n            SCALEUNIT[\"unity\",1],\n            ID[\"EPSG\",8805]],\n        PARAMETER[\"False easting\",28001.642,\n            LENGTHUNIT[\"metre\",1],\n            ID[\"EPSG\",8806]],\n        PARAMETER[\"False northing\",38744.572,\n            LENGTHUNIT[\"metre\",1],\n            ID[\"EPSG\",8807]]],\n    CS[Cartesian,2],\n        AXIS[\"(E)\",east,\n            ORDER[1],\n            LENGTHUNIT[\"metre\",1,\n                ID[\"EPSG\",9001]]],\n        AXIS[\"(N)\",north,\n            ORDER[2],\n            LENGTHUNIT[\"metre\",1,\n                ID[\"EPSG\",9001]]]]"
mpsz_sf2 <- mpsz_sf %>% dplyr::select(SUBZONE_N, REGION_N)
mpsz_sf2 <- st_transform(mpsz_sf2, st_crs(listings_clean))
listings_sf2 <- listings_clean %>% dplyr::select(neighbourhood, room_type)
listings_subzones <- st_join(listings_sf2, mpsz_sf2, prepared=TRUE, join=st_within)
subzone_list <- unique(listings_subzones$SUBZONE_N)
mpsz_sf3 <- mpsz_sf %>% filter(SUBZONE_N %in% subzone_list)
plot(mpsz_sf3["SUBZONE_N"])

Splitting data into subregions

region_list <- unique(listings_subzones$REGION_N)

# Create a store of region SF
region_store_sf <- list()

# Create list of sub-regions
for (i in seq_along(region_list)) {
  region_name <- paste(tolower(word(region_list[i],1)))
  region_name <- gsub("-", "", region_name)
  var_name <- paste("sf", region_name, sep="_")
  region_store_sf[[region_name]] <- mpsz_sf3[mpsz_sf3$REGION_N == region_list[i],c("SUBZONE_N", "PLN_AREA_N")]
}

plot(region_store_sf$northeast)

# Convert to owin store, with subzone boundaries dissolved
region_store_owin <- list()

for (i in seq_along(region_list)) {
  region_name <- paste(tolower(word(region_list[i],1))) %>% gsub("-", "", .)
  var_name <- paste("sf", region_name, sep="_")
  region_store_owin[[region_name]] <- mpsz_sf3[mpsz_sf3$REGION_N == region_list[i],3] %>% st_union(.) %>% as(., Class = "Spatial") %>% as(., "owin")
}

# Plot owin to check
plot(region_store_owin$northeast)

# create ppp in owin objects 
listings_ppp <- as.ppp(listings_sp)
marks(listings_ppp) <- factor(listings_clean$room_type)
listing_ppp_jit <- rjitter(listings_ppp, retry = TRUE, nsim = 1, drop = TRUE)
# Create ppp in owin objects split by sub-regions, rescaled to km
ppp_east <- listing_ppp_jit[region_store_owin$east] %>% rescale(., 1000, "km")
ppp_central <- listing_ppp_jit[region_store_owin$central] %>% rescale(., 1000, "km")
ppp_west <- listing_ppp_jit[region_store_owin$west] %>% rescale(., 1000, "km")
ppp_north <- listing_ppp_jit[region_store_owin$north] %>% rescale(., 1000, "km")
ppp_northeast <- listing_ppp_jit[region_store_owin$northeast] %>% rescale(., 1000, "km")

# create object store split by region, then by room type of form ppp_store$region$roomtype
types_room <- unique(listings_sf$room_type)
ppp_store4 <- list()
for (i in seq_along(region_list)) {
  region_name <- paste(tolower(word(region_list[i],1))) %>% gsub("-", "", .)
  ppp_store4[[region_name]] <- list()
  for (j in seq_along(types_room)) {
    room_name <- paste(tolower(word(types_room[j],1)))
    var_name <- paste("ppp", region_name, sep="_")
    ppp_store4[[region_name]][["all"]] <- eval(as.name(var_name))
    ppp_store4[[region_name]][[room_name]] <- ppp_store4[[region_name]][["all"]] [ppp_store4[[region_name]][["all"]][["marks"]]==types_room[j]]
  }
}

First Order Anaylsis (Kernel Density Estimation)

Second Order Analysis

UI for Proposed Idea

Kernel Density Estimation

Second Order Analysis & Chloropleth Map